-
Notifications
You must be signed in to change notification settings - Fork 3k
API: Define RepairManifests action interface #10784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
API: Define RepairManifests action interface #10784
Conversation
| * An action that will repair manifests. Implementations should produce a new set of manifests where | ||
| * duplicate entries are removed and entries which refer to files which no longer exist are removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interface is prescriptive on what implementations should do in terms of duplicate entries and what implementations should do when files which no longer exist. After some thought I think it's right to define that but I can see arguments where someone wants implementations to just fail in case there are entries which refer to files which no longer exist (Note: I'm specifically talking about cases where implementations cannot recover those files). To me though that can just be an additional validation API someone adds down the line
Also I think at some level the interface needs to be prescriptive to have a clear defined Result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think we may be missing a few other things here and I'm not sure if they should all happen on every invocation. In my mind
- Read metadata from data files and recompute values in manifests
- Remove Manifests or files with FNF
- Removing duplicates
I think we probably also want to be able to do a dry run of this? Something that checks these 3 things but doesn't actually commit any changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVM (dry run is in the api below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amogh-jahagirdar for this. Yea Russell's list sounds good, optionally, I guess we can put the options in later when we implement them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RussellSpitzer @szehon-ho I largely wanted to avoid a user having to think through which options they need to specify, and make sure implementations have sane defaults but I think I agree at least at the actions API level we could have separate configuraiton methods. When the procedure implementation is there, users could pass in what they want and the procedure implementation has some sane defaults (power users who use the raw actions APIs would have to set every option they want explicitly). cc @danielcweeks in case he had any thoughts.
| Iterable<ManifestFile> addedManifests(); | ||
|
|
||
| /** Returns the number of duplicate files removed */ | ||
| long duplicateFilesRemoved(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be more useful if this was an Iterable of the actual duplicate files? same with the missing files below?
TBH, it didn't seem useful to me because someone who is running repair probably wants to know if the table was corrected and to which extent was it corrected, rather than the individual files....but lmk what you think @szehon-ho @RussellSpitzer @danielcweeks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of feel like we do want the iterable here but the procedure could narrow this to the count in the result there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont have a strong preference either way, but if we dont expect millions, it makes sense to have iterable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I think we can just put the Iterable here. Hopefully in the average repair case, it shouldn't really be millions of duplicate files?
| * | ||
| * @return this for method chaining | ||
| */ | ||
| RepairManifests dryRun(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be rather part of the actual Spark procedure rather than in the API? As I'm not sure how one would examine the results of a dry run in this case. For example, RemoveOrphanFilesProcedure has a dry run option part of the procedure, but not as part of the DeleteOrphanFiles API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be part of "apply()" which should already exist because this inherits PendingUpdate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be part of "apply()" which should already exist because this inherits PendingUpdate?
@RussellSpitzer this inherits SnapshotUpdate in the Actions API, there's no apply() in that.
I think you're thinking of the other SnapshotUpdate that's part of the "normal" API. This brings up a good point of should this also have an interface there or not. RewriteManifests is defined in the regular API and then there's also an action for that.
The purpose of the Actions API is to enable engines to provide distributed/other custom implementations of procedures, and I think here we could just start by having it as an action and if there's a desire to add it to the typical API we could do it at that time?
should this be rather part of the actual Spark procedure rather than in the API? As I'm not sure how one would examine the results of a dry run in this case. For example, RemoveOrphanFilesProcedure has a dry run option part of the procedure, but not as part of the DeleteOrphanFiles API
Good point, I missed that this is what orphan file removal does. If I think through it a bit more: by defining dryRun() at the interface level, what we're saying is implementations must have a way of supporting that (the implementation could ignore the option technically but that would be a bad implementation of the interface).
If we look at what orphan file does, the action interface has a "deleteWith" API and the procedure will pass in a no-op deletion function.
What we could do is have an action API "commitRepairWith()" which will accept some sort of commit function to perform the manifest rewrite or if the procedure is a dry-run, some no-op commit.
It feels like it complicates the API more to do that.
I think the real fix is to return Result as part of the dryRun...let me double check if this is feasible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think it's good as it is @nastra. If a user wants to get the results, in the current API they'd do:
RepairManifests.Result result = actionsProvider.repairManifests(table).repairFileEntries().repairEntryStats().dryRun().execute();
// Do whatever with result
| * An action that will repair manifests. Implementations should produce a new set of manifests where | ||
| * duplicate entries are removed and entries which refer to files which no longer exist are removed. | ||
| */ | ||
| public interface RepairManifests extends SnapshotUpdate<RepairManifests, RepairManifests.Result> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this maybe extend Action like the other actions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think SnapshotUpdate is the correct one to extend, here's my thinking:
1.) SnapshotUpdate already extends Action
2.) I think implementations necessarily will produce snapshots when they perform repairs (putting aside the dry run API option). The new snapshot that would be produced, at least in my mind in the reference implementation would be essentially a "rewrite manifests"-like snapshot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah nvm, this seems to be correct. I thought it's extending org.apache.iceberg.SnapshotUpdate while it's actually extending org.apache.iceberg.actions.SnapshotUpdate
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amogh-jahagirdar for pushing this forward!
| * An action that will repair manifests. Implementations should produce a new set of manifests where | ||
| * duplicate entries are removed and entries which refer to files which no longer exist are removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amogh-jahagirdar for this. Yea Russell's list sounds good, optionally, I guess we can put the options in later when we implement them.
| Iterable<ManifestFile> addedManifests(); | ||
|
|
||
| /** Returns the number of duplicate files removed */ | ||
| long duplicateFilesRemoved(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont have a strong preference either way, but if we dont expect millions, it makes sense to have iterable?
4b2ddfe to
83ca336
Compare
| * | ||
| * @return this for method chaining | ||
| */ | ||
| RepairManifests dryRun(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think it's good as it is @nastra. If a user wants to get the results, in the current API they'd do:
RepairManifests.Result result = actionsProvider.repairManifests(table).repairFileEntries().repairEntryStats().dryRun().execute();
// Do whatever with result
| RepairManifests repairEntryStats(); | ||
|
|
||
| /** | ||
| * Configuration method for removing duplicate file entries and removing files which no longer | ||
| * exist in storage | ||
| */ | ||
| RepairManifests repairFileEntries(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the previous discussion @szehon-ho @RussellSpitzer the API now has 2 configuration methods for repairing file entries and reparing entry stats. The first one encompasses both the duplicate + missing case (I don't think there's a need to differentiate those options)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, for completness I wonder if we need to dinstinguish what to repair.. ie, file_size_in_bytes vs metrics ,etc. Can just read the filesystem metadata and save from reading the footer, for example. Probably it is too fine grained, and this option is fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like if we're going through the process of hitting the file (to check size or read footer), we should just repair everything. I can't see a case where you would want to rewrite the information but possibly carryover incorrect information.
| RepairManifests repairEntryStats(); | ||
|
|
||
| /** | ||
| * Configuration method for removing duplicate file entries and removing files which no longer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo?
| RepairManifests repairEntryStats(); | ||
|
|
||
| /** | ||
| * Configuration method for removing duplicate file entries and removing files which no longer | ||
| * exist in storage | ||
| */ | ||
| RepairManifests repairFileEntries(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, for completness I wonder if we need to dinstinguish what to repair.. ie, file_size_in_bytes vs metrics ,etc. Can just read the filesystem metadata and save from reading the footer, for example. Probably it is too fine grained, and this option is fine?
api/src/main/java/org/apache/iceberg/actions/RepairManifests.java
Outdated
Show resolved
Hide resolved
| */ | ||
| RepairManifests dryRun(); | ||
|
|
||
| interface Result { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The one thing I feel is missing here is stats on how many entires were fixed for stats. Do we just assume everything was corrected? It's unclear from a dry run result if running the operation would result in any change if there are no duplicate/missing/recovered files, but you do need to run for missing stats.
Unfortunately, I would expect that fixed stats could be a very large number of files, which wouldn't work with the iterable model for the other categories of issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I don't think we should assume everything was corrected since it involves a footer read, some computation, and in all that flow things can go wrong in that for an individual file but not for another.
I've added two result methods for this:
/** Returns the number of manifest entries for which stats were incorrect */
long entryStatsIncorrect();
/** Returns the number of manifest entries for which stats were corrected */
long entryStatsRepairedCount();
I think it's important to distinguish between how many were repaired and how many were incorrect to begin with. WIth only the repair count, if it's 0 a user wouldn't be able to distinguish "was everything actually OK with the stats" vs "Hey a bunch of entry stats are incorrect but we couldn't actually do the repair".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also removed "added manifests", because as I was working through the implementation a bit more, I came to the conclusion that it's not needed. I think it was just a copy from the rewrite manifests results but
my thinking is that for repairing we're always working with some existing manifests and producing new manifests with rewritten contents but the "source" is always some existing manifest for repair. For repair there isn't really an "added" manifest in the sense of how the rewrite procedure defines it. cc @danielcweeks @szehon-ho @RussellSpitzer . It's also something we could add later on with a default implementation if we realize we need it.
83ca336 to
8bd7c58
Compare
api/src/main/java/org/apache/iceberg/actions/RepairManifests.java
Outdated
Show resolved
Hide resolved
| */ | ||
| RepairManifests dryRun(); | ||
|
|
||
| interface Result { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I don't think we should assume everything was corrected since it involves a footer read, some computation, and in all that flow things can go wrong in that for an individual file but not for another.
I've added two result methods for this:
/** Returns the number of manifest entries for which stats were incorrect */
long entryStatsIncorrect();
/** Returns the number of manifest entries for which stats were corrected */
long entryStatsRepairedCount();
I think it's important to distinguish between how many were repaired and how many were incorrect to begin with. WIth only the repair count, if it's 0 a user wouldn't be able to distinguish "was everything actually OK with the stats" vs "Hey a bunch of entry stats are incorrect but we couldn't actually do the repair".
| */ | ||
| RepairManifests dryRun(); | ||
|
|
||
| interface Result { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also removed "added manifests", because as I was working through the implementation a bit more, I came to the conclusion that it's not needed. I think it was just a copy from the rewrite manifests results but
my thinking is that for repairing we're always working with some existing manifests and producing new manifests with rewritten contents but the "source" is always some existing manifest for repair. For repair there isn't really an "added" manifest in the sense of how the rewrite procedure defines it. cc @danielcweeks @szehon-ho @RussellSpitzer . It's also something we could add later on with a default implementation if we realize we need it.
8bd7c58 to
cb297db
Compare
Co-authored-by: Mathew Fournier <[email protected]>
cb297db to
eb2041b
Compare
|
Hi, @amogh-jahagirdar @RussellSpitzer had a question. I think one motivation was to handle CheckSnapshotIntegrity API proposed in #10642 However in that pr, it is also checking metadata.jsons and manifest-list. for missing metadata-file locations. I think there are other cases it may be worth to repair, for example snapshot summary metadata can be repaired if they are ever out of sync (as the values are mostly updated incrementally and one bad update can cascade for subsequent snapshots). What do you guys think about a generic RepairTable, and have some analgous methods repairSnapshotSummaries or repairSnapshots ? Otherwise we may still need #10642 to do those checks |
|
Another one is : #10535 corrupt manifest file. It may make sense to repair those as well. So in summary:
Let me know what you think, if its within the scope here |
|
@szehon-ho I think I agree that the use cases presented are legitimate repair cases however I'm not sure that we should be combining all of them into a single That said one thing to consider is do we need these procedures to move in lockstep somehow to ensure that statistics/snapshot summaries roll up correctly? I don't think that's the case since in most cases we are producing new snapshots entirely after the repair and the deltas in rows/files added/deleted should be accumulated correctly? We'll probably need tests for that. I do think that we could add a More concretely at least what I'm thinking so far:
@szehon-ho I actually had a question on the snapshot repair, based on the description the goal of that is to repair snapshot summary stats which may have been corrupted. Doesn't that necessarily mean we must mutate the existing snapshot (and subsequent snapshots) to correct it? Overall though I'll def give the general |
|
Makes sense @amogh-jahagirdar !
i was just thinking we make a new snapshot with correct summary stats (only the totals). But yes you are right, it is open to interpretation, in this case you cant go back and fix wrong stats, so this particular feature probably does need more thought. |
Really sorry for the delayed response @szehon-ho I forgot I had this open. So I was discussing this with @rdblue and I am more convinced of your point that we may as well have a unified Furthermore, the procedure is traversing the entire metadata tree so practically there's probably overlap across the different repair operations. I can update the PR based on this |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
|
For my edification, can someone please explain how duplicate file entries in manifests can arise? Can two entries for the same file occur in a single manifest? Can even two manifests be in the same manifest list if they overlap (have an entry for the same file in common)? I'd have thought both of these situations would be bugs. Or are there actual sequences of operations that lead to such outcomes, similar to how dangling deletes can occur? |
Sure, so one example of an issue that happened in the past is that in the Kafka Connect integration we ended up appending the same file multiple times. We rectified that in Kafka connect and the Iceberg library for duplicate appends in the same snapshot, but it's still technically possible to append the same file across different snapshots (at least in the reference implementation and probably a few others). Detecting overlapping files involves an expensive read through current manifest(s) to deduplicate which if performed on every append would be prohibitively expensive for the operation. Overlapping files across manifests does imply a bug in whatever integration is writing to the table. However, even after those bugs are fixed, it still makes sense for Iceberg to expose repair procedures to correct those tables to unblock users from using their tables.
Yeah I think the same I mentioned above applies where some random implementation may end up writing incorrect statistics for whatever reason, and it'd be good for repair table to correct that since stats are something that can be deterministically corrected. |
|
Another question I have about duplicate entries (in the manifests) is: do their presence make the table unreadable? Or is the table still readable and it is a valid state, although undesirable (as is the case with dangling deletes, for example)? An observation is that the existing subinterfaces of |
|
Do we still pursue this? |
Yes, @amogh-jahagirdar is still looking at this and we continue to see use cases where this is necessary. |
|
@danielcweeks thanks for the confirmation. One of use cases is to check the integrity of a copied table from different places. The use cases are increasing as |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
|
I see this PR has been closed, which is sad. Should it be re-opened, @amogh-jahagirdar, @danielcweeks? I'd love to see |
As an enthusiastic user of the new rewriteTablePath, I would find this use case compelling. I didn't actually realize that was within the scope of this PR since the name implies a focus on "repair". Any ETA on when we could use it for this migration confirmation? |
I believe this depends on the reader. According to these docs it appears Snowflake will not read tables that have duplicates in the manifest files |
|
Ok this has been open for quite some time, and I haven't been able to follow through with this due to bandwidth but now I'm going to make time to follow through on this since I've received multiple reach outs on the usefulness of this procedure. From what I recall the pending decision and update had to do with making this a generic I will update this week when I get a chance, in the mean time for folks who have table with duplicate manifests can build and run from https://github.com/amogh-jahagirdar/iceberg/tree/repair-manifests-action |
Co-authored-by: Mathew Fournier [email protected]
Refer to #10445 for more details.
This change defines a RepairManifest Action interface where implementations should produce a new set of manifests where
duplicate entries should be removed, and entries which refer to files which no longer exist on disk should also be removed.